Customizing A Lexicon To Better Suit A Computational Task
نویسندگان
چکیده
We discuss a method for augmenting and rearranging a structured lexicon in order to make it more suitable for a topic labefing task, by making use of lexical association information from a large text corpus. We first describe an algorithm for converting the hierarchical structure of WordNet [13] into a set of flat categories. We then use lexical cooccurrence statistics in combination with these categories to classify proper names, assign more specific senses to broadly defined terms, and classify new words into existing categories. We also describe how to use these statistics to assign schema-like information to the categories and show how the new categories improve a text-labeling algorithm. In effect, we provide a mechanism for successfully combining a hand-built lexicon with knowledge-free, statistically-derived information. 1 I n t r o d u c t i o n Much effort is being appl ied to the creation of lexicons and the acquisi t ion of semant ic and syntact ic a t t r ibu tes of the lexical i tems tha t comprise them, e.g, [1], [4],[7],[8], [11], [16], [18], [20]. However, a lexicon as given may not suit the requirements of a par t icu la r computa t iona l task. Because lexicons are expensive to build, ra ther than create new ones from scratch, i t is preferable to ad jus t existing ones to meet an appl ica t ion ' s needs. In this paper we describe such an effort: we add associat ional informat ion to a hierarchical ly s t ructured lexicon in order to bet ter serve a text labeling task. An a lgor i thm for par t i t ion ing a full-length exposi tory text into a sequence of subtopiea l discussions is described in [9]. Once the par t i t ioning is done, we need to assign labels 1 indicat ing what the subtopical discussions are about , for the purposes of informat ion retr ieval and hyper tex t navigat ion. One way to label texts, when working within a l imited domain of discourse, is to s t a r t with a pre-defined set of topics and specify the word contexts tha t indicate the topics of interest (e.g., [10]). Another way, assuming tha t a large collection of prelabeled texts exists, is to use s tat is t ics to au tomat i ca l ly infer which lexical i tems indicate which labels (e.g., [12]). In contrast , we are interested in assigning labels to general, domainindependent text , wi thout benefit of pre-classified texts. In all three cases, a lexicon tha t specifies which lexical i tems correspond to which topics is required. The topic label ing method we use is s ta t is t ica l and thus requires a large number of representat ive lexical i tems for each category. The s ta r t ing point for our lexicon is WordNet [13], which is readi ly available online and provides a large reposi tory of English lexical i tems. WordNet 2 is composed of synse t s , 1 The terms "label" and "topic" are used interchangeably in this paper. 2 All work described here pertains to Version 1.3 of WordNet.
منابع مشابه
Customizing GermaNet for the Use in Deep Linguistic Processing
In this paper we show an approach to the customization of GermaNet to the German HPSG grammar lexicon developed in the Verbmobil project. GermaNet has a broad coverage of the German base vocabulary and fine grained semantic classification, while the HPSG grammar lexicon is comparatively small und has a coarse-grained semantic classification. In our approach, we have developed a mapping algorith...
متن کاملSensicon: An Automatically Constructed Sensorial Lexicon
Connecting words with senses, namely, sight, hearing, taste, smell and touch, to comprehend the sensorial information in language is a straightforward task for humans by using commonsense knowledge. With this in mind, a lexicon associating words with senses would be crucial for the computational tasks aiming at interpretation of language. However, to the best of our knowledge, there is no syste...
متن کاملبررسی و مقایسه رشد جنبه محتوایی مهارت تعریف واژه در دانشآموزان 7 تا 12 ساله فارسیزبان
Objective Language has three components: content, form and pragmatic. The content includes the semantic components. Semantic knowledge of word relationships requires awareness of the relationships between different words in the same field and other fields. One of the main components of the semantic is the mental lexicon that many of the semantic communications, including the organization and se...
متن کاملCustomizing Qualitative Spatial and Temporal Calculi
Qualitative spatial and temporal calculi are usually formulated on a particular level of granularity and with a particular domain of spatial or temporal entities. If the granularity or the domain of an existing calculus doesn’t match the requirements of an application, it is either possible to express all information using the given calculus or to customize the calculus. In this paper we distin...
متن کاملMental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners
The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...
متن کامل